Day 13: Getting Started with Transformers Library
Hugging Face Transformers is a library that lets you use thousands of pre-trained models with just a few lines of code. Today we will quickly cover everything from installation to the core feature pipeline().
Installation and Basic Setup
# Installation
# pip install transformers torch
from transformers import pipeline
# Create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("This movie was really fun!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Process multiple sentences at once
texts = ["The weather is nice today", "The service was very rude"]
results = classifier(texts)
for text, res in zip(texts, results):
print(f"{text} -> {res['label']} ({res['score']:.4f})")
Performing Various Tasks with pipeline()
A single pipeline() function can handle various NLP tasks including translation, summarization, and text generation. Internally, it automatically downloads the model and tokenizer.
from transformers import pipeline
# Text summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """
AI technology is rapidly advancing and affecting various industries.
In particular, large language models are demonstrating human-level
performance in text generation, translation, code writing, and other areas.
Companies are leveraging these technologies to improve work efficiency
and develop new services.
"""
summary = summarizer(article, max_length=50, min_length=10)
print(summary[0]["summary_text"])
# Translation (English -> French)
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
print(translator("Hello, how are you today?"))
Fine-Grained Control with AutoModel and AutoTokenizer
While pipeline() is convenient, loading the model and tokenizer directly gives you more fine-grained control.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print(f"Model is running on {device}.")
# Tokenize and pass directly to the model
inputs = tokenizer("Transformers library is amazing!", return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=-1).item()
print(f"Predicted class: {predicted_class}")
Use pipeline() for rapid prototyping, and the AutoModel/AutoTokenizer combination when you need custom logic. Choose the appropriate approach based on the situation.
Today’s Exercises
- Use
pipeline("zero-shot-classification")to classify any news article text into the categories “Politics”, “Economy”, “Sports”, and “Technology”. - Use
pipeline("text-generation")to generate 3 continuations for the prompt “The future of artificial intelligence is”. Utilize thenum_return_sequencesparameter. - Tokenize 3 English sentences with
AutoTokenizerand print each sentence’s token count and token list. Use the modelbert-base-multilingual-cased.